Memcached Server Functionality in a Cluster of Data Processing Nodes

ABSTRACT

A method is performed by a first server on a chip (SoC) node that is one instance of a plurality of nodes within a cluster of nodes. An operation is performed for determine if a second one of the SoC nodes in the cluster has data stored thereon corresponding to a data identifier in response to receiving a data retrieval request including the data identifier. An operation is performed for determining if a remote memory access channel exists between the SoC node and the second one of the SoC nodes. An operation is performed for access the data from the second one of the SoC nodes using the remote memory access channel after determine that the second one of the SoC nodes has the data stored thereon and that the remote memory access channel exists between the SoC node and the second one of the SoC nodes.

CROSS REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. application Ser. No.15/357,332, filed Nov. 21, 2016, which is a continuation of U.S.application Ser. No. 13/728,428, filed Dec. 27, 2012, which is acontinuation-in-part of U.S. application Ser. No. 13/453,086 filed Apr.23, 2012, which is a continuation-in-part of U.S. application Ser. No.12/794,996 filed Jun. 7, 2010 which claims priority to U.S. ProvisionalApplication No. 61/256,723 filed Oct. 30, 2009, all of theseapplications having a common applicant herewith and being incorporatedherein in their entirety by reference.

BACKGROUND 1. Field of the Invention

The embodiments of the present invention relate to allocation anddisassociation of disparate computing resources of clustered computingnodes. More specifically, embodiments of the present invention relate tosystems and methods for providing memcached server functionality in acluster of data processing nodes such as for allowing access to cachedinformation from one or more data processing nodes within a cluster ofdata processing nodes.

2. Description of Related Art

Conventionally, network systems used different topologies, e.g. Ethernetarchitecture employed a spanning tree type of topology. Recently,Ethernet fabric topology has been developed that provides a higher levelof performance, utilization, availability and simplicity. Such Ethernetfabric topologies are flatter and self-aggregating in part because ofthe use of intelligent switches in the fabric that are aware of theother switches and can find shortest paths without loops. One benefit isthat Ethernet fabric topologies are scalable with high performance andreliability. Ethernet fabric data center architectures are availablefrom Juniper, Avaya, Brocade, and Cisco.

A “shared nothing architecture” is a distributed computing architecturein which each node is independent and self-sufficient. Typically, noneof the nodes share memory or disk storage. A shared nothing architectureis popular for web development because of its scalability. What isdeficient in typical shared nothing clusters is the ability to allowmemory capacity to be provisioned based on workload on a per-node basis,to implement memcached functionality on a per-node basis across aplurality of nodes in a cluster, to load/store from remote memory, toperform remote DMA transactions, and to perform remote interrupts.

SUMMARY

The system and method of the present invention provide flexible methodsof extending these distributed network systems beyond the typical sharednothing cluster to accommodate different protocols in varying networktopologies. The systems and methods hereof provide the ability toload/store from remote memory, implement memcached functionality on aper-node basis across a plurality of nodes in a cluster, perform remoteDMA transactions, perform remote interrupts, allow a wide range of usecases that greatly extend performance, power optimization, andfunctionality of shared nothing clusters. Several examples are describedwhich include network acceleration, storage acceleration, messageacceleration, and shared memory windows across a power-optimizedinterconnect multi-protocol fabric.

In one embodiment, a method is performed by a first server on a chip(SoC) node that is one instance of a plurality of nodes within a clusterof nodes. The method comprises a plurality of operations. An operationis performed for determine if a second one of the SoC nodes in thecluster has data stored thereon corresponding to a data identifier inresponse to receiving a data retrieval request including the dataidentifier. An operation is performed for determining if a remote memoryaccess channel exists between the SoC node and the second one of the SoCnodes. An operation is performed for access the data from the second oneof the SoC nodes using the remote memory access channel after determinethat the second one of the SoC nodes has the data stored thereon andthat the remote memory access channel exists between the SoC node andthe second one of the SoC nodes. The operations can be performed by oneor more processors that access, from memory allocated or otherwiseaccessible to the one or more processors, instructions that embody theoperations and that are processible by the one or more processors.

In another embodiment, a non-transitory computer-readable medium hastangibly embodied thereon and accessible therefrom a set of instructionsinterpretable by one or more data processing devices of a first SoC nodein a cluster of SoC nodes. The set of instructions is configured forcausing the one or more data processing devices to implement operationsfor determining if a second SoC node in the cluster has data storedthereon corresponding to a data identifier, determining if a remotememory access channel exists between the first SoC node and the secondSoC node, and accessing the data from the second SoC node using theremote memory access channel after determining that the second SoC nodehas data stored thereon and that the remote memory access channel existsbetween the first and second SoC nodes.

In another embodiment, a data processing system comprises a first serveron a chip (SoC) node characterized by a SoC node density configurationenabling the second SoC node to serve in a role of providing informationcomputing resources to one or more data processing systems and a secondSoC node characterized by a memory configuration enabling the second SoCnode to serve in a role of enabling memory resources thereof to beallocated to one or more other SoC nodes. The first SoC node is coupledto the second SoC node by a remote memory access channel. One or moreprocessors of the first SoC node is configured for accessing andprocessing instructions for causing the first SoC node to determine ifthe second SoC node has data stored thereon corresponding to a dataidentifier received by the first SoC node from a particular one of theone or more data processing systems. One or more processors of thesecond SoC node is configured for accessing and processing instructionsfor causing the second SoC node to provide the data stored thereon tothe first SoC node using the respective remote memory access channel.

These and other objects, embodiments, advantages and/or distinctions ofthe present invention will become readily apparent upon further reviewof the following specification, associated drawings and appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a high level diagram of a topology for a network system;

FIG. 2 is a block diagram of a network node in accordance with oneembodiment of the present invention;

FIG. 3 is a block diagram of a network node in accordance with a secondembodiment of the present invention;

FIG. 4 is a diagram showing local to remote address mapping;

FIG. 5 is a diagram showing validation of a remote transaction;

FIG. 6 is a schematic depicting an I/O physicalization; and

FIG. 7 is a schematic showing high performance distributed sharedstorage.

FIG. 8A is a diagram showing a node cluster architecture in accordancewith one embodiment of the present invention.

FIG. 8B is a diagram showing a memory controller node chassis inaccordance with one embodiment of the present invention.

FIG. 8C is a diagram showing a rack with a plurality of compute nodechassis utilized in a rack with the memory controller node chassis ofFIG. 8B in accordance with one embodiment of the present invention.

FIG. 9 is a diagram showing a memory hierarchy structure for computernodes in accordance with one embodiment of the present invention.

FIG. 10 is a diagram showing a functional block diagram configured forimplementing remote memory access functionality in accordance with oneembodiment of the present invention.

FIG. 11 is a diagram showing physical address space of a particular oneof the compute nodes shown in FIG. 8 .

FIG. 12 is a diagram showing an embodiment of the present inventionconfigured for providing memcached server functionality.

FIG. 13A is a diagram showing an embodiment of the present inventionconfigured for implementing memory storage functionality usingPartitioned Global Address Space (PGAS) languages.

FIG. 13B is a diagram showing a global memory space that is partitionedbetween participating threads for pooled memory functionality using PGASlanguages.

FIG. 14A is a diagram showing an embodiment of the present inventionconfigured for implementing hybrid memory cube (HMC) deployed nearmemory pools.

FIG. 14B is a diagram showing a private HMC of compute nodes coupled toa HMC deployed near memory pool.

FIG. 15 illustrates a logical view of a system on a chip (SoC).

FIG. 16 illustrates a software view of a power management unit.

DETAILED DESCRIPTION

FIG. 1 shows an example of a high-level topology of a network system 100that illustrates compute nodes connected by a switched interconnectfabric. Network ports 101 a and 101 b come from the top of the fabric toexternal network connectivity. These network ports are typicallyEthernet, but other types of networking including Infiniband arepossible. Hybrid nodes 102 a-n are compute nodes that comprise bothcomputational processors as well as a fabric packet switch. The hybridnodes 102 a-n have multiple interconnect links to comprise thedistributed fabric interconnect (i.e., a node interconnect fabric thatprovides an inter-node communication channel between a plurality of SoCnodes).

A recommended implementation for the fabric interconnect is a high-speedSerDes interconnect, such as multi-lane XAUI. In the preferred solution,a four-lane XAUI interconnect is used. Each of the four lanes can alsohave the speed varied from 1 Gb/sec (SGMII), XAUI rate (3.125 Gb/sec),and double XAUI (6.25 Gb/sec). The actual number of lanes andvariability of speeds of each lane are implementation specific, and notimportant to the described innovations. Other interconnect technologiescan be used that have a means to adaptively change the effectivebandwidth, by varying some combination of link speeds and widths. Powerconsumption of a link is usually related to the delivered bandwidth ofthe link. By reducing the delivered bandwidth of the link, eitherthrough link speed or width, the power consumption of the link can bereduced.

Related application Ser. No. 12/794,996 (incorporated by reference)describes the architecture of a power-optimized, high performance,scalable inter-processor communication fabric. FIG. 1 shows a high-leveltopology 100 of a network system, such as described in the '996 RelatedApplication, that illustrates XAUI connected SoC nodes connected by theswitching fabric. The 10 Gb Ethernet ports Eth0 101 a and Eth1 101 bcome from the top of the tree. Most, if not all of the hybrid nodes 102a-n comprise both computational processors as well as an embedded switchas described below in conjunction with FIGS. 2-3 . The hybrid nodes 102a-n have five XAUI links connected to the internal switch. The switchinglayers use all five XAUI links for switching. For example, as shown inFIG. 1 , level 0 leaf nodes 102 d, e (i.e., N0n nodes, or Nxy, wherex=level and y=item number) only use one XAUI link to attach to theinterconnect, leaving four high-speed ports that can be used as XAUI, 10Gb Ethernet, PCIe, SATA, etc., for attachment to I/O. The vast majorityof trees and fat tree-type network systems have active nodes only asleaf nodes, and the other nodes are pure switching nodes. This approachmakes routing much more straightforward. Network system 100 has theflexibility to permit every hybrid node 102 a-n to be a combinationcomputational and switch node, or just a switch node. Most tree-typeimplementations have I/O on the leaf nodes, but system 100 lets the I/Obe on any node. In general, placing the Ethernet at the top of the treeas at 101 a/101 b minimizes the average number of hops to the Ethernet.

In a preferred example, the hybrid nodes 102 a-n shown in thetree-oriented topology of system 100 in FIG. 1 may represent independentnodes within a computing cluster. FIG. 1 illustrates one exampleimplementation of individual nodes 102 a-n of the cluster. When lookingat a conventional implementation of a topology e.g. in FIG. 1 ,computing nodes are usually found in the lower level leaf nodes (e.g.N00-N017), and the upper level nodes do not have computing elements butare just network switching elements (N20-N31).

FIG. 2 illustrates one example of a “personality module” 200 inaccordance with the present invention which is specifically designed forEthernet protocol. Such an Ethernet personality module 200 can be usedas a hybrid node for one or more of the nodes 102 a-n of FIG. 1 . Withthe node architecture shown in FIG. 2 , the CPU Cores 206 of eachpersonality module may be optionally enabled, or could be just leftpowered-off. With a personality module 200 used for the upper levelswitching nodes (N20-N30) in FIG. 1 , the modules can be operated aspure switching elements (like traditional implementations), or the CPUCores module 206 can be enabled and used as complete compute nodeswithin the computing cluster.

Note that the tree oriented interconnect fabric of FIG. 1 is simply oneexample of a type of server interconnect fabric. The concepts andinventions described herein have no dependency on the specific topologyof interconnect fabric or protocol employed.

In more detail, the personality module 200 of FIG. 2 may be used as oneor more of the hybrid nodes in the network system of FIG. 1 . In FIG. 2, processors 205/206 communicate with the Ethernet MAC controllers 202via the internal SOC processor bus fabric 201. Ethernet MAC controllers202 generate Ethernet frames. The Ethernet Bridges 203 prepend a fabricrouting header to the beginning of the Ethernet Frame. The EthernetBridges 203 contains the layer 2 Ethernet processing and computes therouting header based upon a distributed layer 2 Ethernet switch. Askilled person will appreciate that processors utilized in embodimentsof the present invention (e.g., processors 205/206) are notunnecessarily limited to any particular model or brand of processor.

The Ethernet Bridges 203 in FIG. 2 receives an Ethernet frame from theEthernet MAC controllers 202 in FIG. 2 , sending an augmented routingframe to the fabric switch 204. Note that all frames that are flowingwithin the fabric are routing frames, not Ethernet frames. The Ethernetframe/routing frame conversion is done only as the packet is entering orleaving the fabric via a MAC. Note also that the routing logic withinthe switch may change fields within the routing frame. The Ethernetframe is never modified (except the adding/removing of the preamble,start of frame, and inter-frame gap fields).

The routing frame is composed of several fields providing sufficientdata for the fabric switch 204 of FIG. 2 to make routing and securitydecisions without inspection of the underlying Ethernet frame which isconsidered an opaque payload. The resulting routing frame is thus acatenation of the routing frame header and the payload frame.

Related application Ser. No. 12/794,996 (incorporated by reference)disclosed in more detail an Ethernet protocol focused fabric switch. Inthe related '996 application two primary components are described:

-   -   An Ethernet Routing Header processor that inspects Ethernet        frames, and adds/removes the fabric switch routing header.    -   The fabric switch that is responsible for transporting the        packet between nodes by only using data from the routing header.

A key attribute of the Fabric Switch, 204 in FIG. 2 , is that packetsmay be securely routed to their destination node/port by only using datain the routing header, without any inspection of the underlying datapayload. Thus the data payload is considered opaque and invariant.

FIG. 3 illustrates a preferred embodiment of a multi-protocolpersonality module 300 that is similar to the Ethernet protocol moduleof FIG. 2 . The module of FIG. 3 is similar to the Ethernet fabricmodule of FIG. 2 in that it continues to be responsible for transportingpackets between nodes by only using data from the routing header.However, the multi-protocol personality module 300 of FIG. 3 operateswith multiple protocols to accommodate a network operating withdifferent protocols. Protocol specific personality modules are addedsuch that routing header processing is done in new and separate fabricpersonality modules that provide mappings from specific protocolsemantics to fabric routing headers. The multi-protocol personalitymodule 300 of FIG. 3 , like the Ethernet module of FIG. 2 , isresponsible for adding a routing header for packets entering the fabric,and removing the routing header when packets are leaving the fabric. Therouting header maintains in place as the packets are transported node tonode across the fabric.

The multi-protocol personality module 300 of FIG. 3 includes a portionfor processing Ethernet (302, 304) which function much like the moduleof FIG. 2 , and a portion (e.g., components 303, 305, 306, 307) forallowing bus transactions to be transported across the fabric, offeringthe ability to remote memory, I/O, and interrupt transactions across thefabric. In some embodiments of the present invention, a Remote BusPersonality Module of the multi-protocol personality module 300comprises the portion of the multi-protocol personality module 300 thatallows bus transactions to be transported across the fabric therebyenabling the ability to remote memory, I/O, and interrupt transactionsacross the fabric. In this regard, the Remote Bus Personality Moduleenables functionality related to allowing bus transactions to betransported across the fabric thereby provides the ability to remotememory, I/O, and interrupt transactions across the fabric.

As can be seen from the block diagram of FIG. 3 depicting an exemplarymulti-protocol module 300, the Fabric Switch 308 transports packetsacross nodes of inter-node fabric (i.e., an inter-node communicationchannel defined thereby) therebetween by inspection of only the routingheader. The routing header is composed of several fields providingsufficient data for the fabric switch 308 to make routing and securitydecisions without inspection of the underlying opaque data payload. Theresulting routing frame is thus a catenation of the routing frame headerand the opaque payload frame. One example of a payload frame is anEthernet frame. For example, as shown in Table 1 below, a routing framemight comprise:

TABLE 1 Routing Frame Header Ethernet Frame Packet RF Header MAC MACEthertype/ Payload (data and CRC32 destination Source Length padding)

An example of a routing header follows in Table 2, but the fields mayvary by implementation:

TABLE 2 Width Field (Bits) Notes Domain ID 5 Domain ID associated withthis packet. 0 indicates that no domain has been specified. Mgmt Domain1 Specifies that the packet is allowed on the private management domain.Source Node 12 Source node ID Source Port 2 0 = MAC0, 1 = MAC1, 2 =MAC_management processor, 3 = MAC OUT Dest Node 12 Destination node IDDest Port 2 0 = MAC0, 1 = MAC1, 2 = MAC_management processor, 3 = MACOUT RF Type 2 Routing Frame Type (0 = Unicast, 1 = Multicast, 2 =Neighbor Multicast, 3 = Link Directed) TTL 6 Time to Live - # of hopsthat this frame has existed. Switch will drop packet if the TTLthreshold is exceeded (and notify management processor of exception).Broadcast ID 5 Broadcast ID for this source node for this broadcastpacket. Checksum Checksum of the frame header fields.

Since the Fabric Switch 308 makes routing decisions by inspection ofonly the routing header, and the data payload frame is considered bothopaque and invariant, these characteristics can be leveraged to createan extensible set of personality modules. A multi-protocol personalitymodule 300 such as shown in FIG. 3 provides a mapping from specificprotocols to add and remove the fabric routing headers for thatprotocol.

When using a personality module 300 such as shown in FIG. 3 as a hybridnode 102 a-n in the system of FIG. 1 , as previously stated, all framesthat are flowing within the fabric are routing frames, not Ethernetframes. The payload frame/routing frame conversion is done only as thepacket is entering or leaving the fabric. Note also that the routinglogic within the switch may change fields within the routing frame. Thepayload frame is never modified.

The Ethernet Bridge personality processor 304 in FIG. 3 , isfunctionally identical to the Routing Header processor in Relatedapplication Ser. No. 12/794,996, but generalized from a single-protocolprocessor (such as FIG. 2 ), to a module having a number of protocolprocessing portions. The Ethernet Bridge Processor 304 adds the routingheader as the packet comes from the Ethernet MAC 302 to the fabricswitch 308, and removes the routing header as the packet comes from thefabric switch 308 to the MAC 302.

Similar to FIG. 2 , the processors 312/314 communicate with the EthernetMAC controllers 302 in FIG. 3 via the internal SOC processor bus fabric301. Ethernet MAC controllers 302 generate Ethernet frames. The EthernetBridge 304 prepends a fabric routing header to the beginning of theEthernet Frame. The Ethernet Bridge 304 contains the layer 2 Ethernetprocessing and computes the routing header based upon a distributedlayer 2 Ethernet switch.

As disclosed above in reference to the multi-protocol personality module300 of FIG. 3 , the Remote Bus Personality Module includes the RemoteInterrupt Manager 303, the Remote Address translation module 305, theBus Bridge 306 and the Remote Bus Processor 307. In FIG. 3 , the BusFabric 301 represents the internal bus fabric of a system on a chip(SOC). As discussed below, the SoC can be configured to provide serverfunctionality and thus be referred to as a server on a chip. This busfabric carries CPU mastered load/store transactions to memory and I/O,as well as I/O mastered transactions, e.g. initiated by I/O DMAcontrollers.

The functionality of the Remote Bus personality Module consists of

-   -   The Remote Address translation module 305, which converts local        addresses steered to the Remote Bus Personality Module (RBPM) to        [Remote Node, Remote Node Address].    -   The Bus Bridge 306, which converts a processor bus of arbitrary        address and data width into a packed, potentially multi-flit        packet. In this regard, the Bus Bridge 306 converts a processor        bus of arbitrary address and data width into packetized        transfers across the fabric.    -   The Remote Bus Processor 307, which adds and removes the fabric        routing header, transports bus packets from Bus Bridge 306 and        interrupts from Remote Interrupt Manager 303 over the fabric        in-order with guaranteed delivery.

The Remote Address translation module 305 converts local addressessteered to the RBFPM to [Remote Node, Remote Node Address]. This isdepicted in more detail in FIG. 4 which shows that there is a set ofmapping tables from [local address, size] to [Node ID, Remote address].This address translation can be implemented as a custom module,typically leveraging a CAM (Content Addressable Memory). Alternatively,this stage may be implemented with a standard IP block of an I/O MMU(memory management unit) which translates the intermediate physicaladdress in a bus transaction to a physical address. In this case, thesetranslation tables are configured so that the resulting physical addressencodes the [Remote Node ID, and Remote Address].

The Bus Bridge 306 of FIG. 3 functions to interface to and packetize theCPU/I/O bus transactions. In this regard, the Bus Bridge 306 canfunction as a packetizer. This Bus Bridge 306 is conceptually designedas having a layered model. In any given implementation, these layers mayor may not be present, and will have tuned functionality for the busbridging that is being implemented.

The multiple layer design of the Bus Bridge 306 is:

-   -   Transaction layer        -   The Transaction layer performs any necessary transforms that            understand multiple bus channels or that understand the            semantics of the transaction.    -   Transfer layer (also known as Transport layer)        -   The Transfer layer performs any necessary transforms within            a channel related to the overall data transfer. This could            include data compression.    -   Data Link layer        -   The Data Link layer performs arbitration, multiplexing and            packing of channels to a physical packet representation.        -   Implements any necessary flow control.    -   Physical layer

The Physical layer performs transformation and optimization of thephysical packet representation to packet size, width, and flitrequirements to the fabric switch implementation. This Physical layerand/or the Link layer may actually produce multiple flits correspondingto a single physical bus packet.

The Remote Bus Processor 307 functions in a similar manner to theEthernet Bridge Personality Processor 304 to add and remove the fabricrouting header and transport bus packets from 306 to the fabric switch308. Additionally, the Remote Bus Processor 307 connects interrupts fromRemote Interrupt Manager 303 over the fabric with guaranteed delivery.

Example 1: Distributed One-Sided Cache Coherent Shared Memory Across theFabric

In FIG. 1 , one or more of the compute nodes could constitute servers,and the fabric connects two or more servers. The ability to open upmemory sharing windows in another server across the fabric enables awide-range of new capabilities that are not possible in traditional“shared nothing” clusters. In this example, the form that a load orstore bus transaction issued by Server Node A is targeting a physicaladdress in Server Node B. Such bus transactions may originate from anybus master in Node A, including processors, 110 bus masters (such as aSATA controller), or a DMA engine.

FIG. 4 illustrates the first stage of a remote shared memory accesstransaction using the Remote Bus Personality portion of the module ofFIG. 3 . As shown in FIG. 4 , a bus master on Node A issues a load orstore transaction to a range of physical addresses mapped to the RemoteBus Personality portion. The transaction appears as a bus transaction onFIG. 3 , on Bus Fabric 301. The SOC busses of Bus Fabric 301, such as anARM AXI, have configurable address and data widths, as an example 40address bits, and 64-128 data bits.

The transaction flows through the Bus Bridge 306 as illustrated in FIG.3 packetizing the bus transaction and creating one or more flitsoptimized for the fabric switch 308. The packetized transaction flowsthrough the Remote Bus Processor 307 to create the routing header forthe fabric. The remote bus packets are required to be delivered todestination server B in-order and with guaranteed delivery. If theunderlying fabric and fabric switch do not implicitly have thesecharacteristics, then the Remote Bus Processor 307 is required toimplement in-order and guaranteed delivery.

The resulting routing frame flows into the fabric switch 308 on Node A,is routed through the intervening fabric (See FIG. 1 ), which mayconsists of multiple routing hops, and is delivered to the fabric switchon target Node B. For example, comparing FIG. 1 , Node A might be nodeN30 and target Node B could be represented as node N014. The packet fromfabric switch 308 of Node A is identified as a remote bus transaction,and is delivered to the Remote Bus Processor 307 on Node B.

Node B's Remote Bus Processor 307 implements the receiving side ofin-order and guaranteed delivery in conjunction with the transmittingside. This can include notification of the sender of errors, missingflits, and request for retransmission, The Remote Bus Processor 307 ofNode B then strips the routing header, sending the packetizedtransaction into the Bus Bridge 306. The Bus Bridge module 306 of Node Bunpacks the packetized transaction (which may have included collectingmultiple flits), and reconstitutes a valid transaction posted to NodeB's bus. Any responses to that bus transaction are seen by thissubsystem, and sent back to Node A following the same mechanism.

There are several functional and performance issues related to thiscache coherency example. First, coherent memory transactions issued byCPUs in node A will not snoop caches on remote nodes to maintainefficiency. Second, incoming remote transactions from a Remote BusPersonality section can be implemented as one-sided cache coherent. Thismeans that incoming loads or stores can optionally be configured tosnoop and perform coherency protocols against processor caches. Finally,this provides a powerful, easy to use cache coherent programming modewithout the performance and availability problems related to a fullCC-NUMA (cache coherent—non-uniform memory access) design.

Example 2: Remote Bus Personality Module—Remote Interrupts

In many SOC bus infrastructures, interrupts are individual lines thatfeed into an interrupt controller for the processor(s) such as theRemote Interrupt Manager 303 of FIG. 3 . These individual interruptlines are sometimes OR'd with each other to map multiple interruptsources to a single interrupt line.

For example, if server A (such as Node N30 of FIG. 1 ) processorgenerates an interrupt on server B (such as Node N14 of FIG. 1 ): First,Server A writes to a remote CSR (control status register) on server Bwhich maps to the requested interrupt, such as the an interrupt line ofInterrupt Manager 303 of FIG. 3 . The interrupt line is made active andinterrupts the Remote Bus Processor 307 on server B.

As another example, an I/O interrupt on server A can be reflected to aninterrupt on server B. An I/O controller on server A (like a SATAcontroller) raises an interrupt line that is being monitored by theRemote Interrupt Manager 303, FIG. 3 . The Remote Interrupt Manager 303gets woken by an interrupt line that it is being monitored. RemoteInterrupt Manager 303 creates a packet tagged as an interrupt packet andsends it into the Remote Bus Processor 307. This interrupt packet flowsthrough the fabric as described above. When the interrupt packer reachesserver B, the interrupt packet is delivered to Remote Bus Processor 307,which notes the specially tagged interrupt packet and sends it to theremote interrupt manager 303 of server B. Remote interrupt manager 303causes the specified interrupt line to go active in server B.

Example 3: Remote Address Translation and Security

Referring to FIG. 3 , block 314 is a management CPU core (See also MgmtCore 205 of FIG. 2 ). This management CPU 314 is a key part ofmaintaining fabric security for remote bus transactions. The managementCPU 314 maintains multi-node fabric transaction security on both sidesof the transaction.

Each Remote Bus Processor 307 is allocated a range of addresses inphysical address space. An exemplary process for the secure mapping ofan address range from Server B into Server A's address space is asfollows.

-   -   1. Main OS processor on Server A (block 312 in FIG. 3 ) sends a        mapping request of tuple (node #, physical address in node #'s        address space, and window length) to local management processor.    -   2. Management CPU 314 on Server A has the ability to accept or        deny the remote mapping request. Upon local acceptance,        management CPU on server A sends a secure management request        with the remote mapping request to management CPU 314 on server        B.    -   3. Management CPU 314 on server B has the ability to accept or        deny the remote mapping request from Server A.    -   4. Upon acceptance, management CPU 314 on server B installs a        mapping into the I/O MMU on server B, mapping an IPA window to        the requested physical address. Additionally the Remote Bus        Processor 307 on server B installs a mapping that designates        that remote node A has access to that window.        -   Mappings can be granted as read-only, write-only, or            read-write.        -   These mappings are illustrated in FIG. 5 .        -   These mappings can be implemented using a standard IP block            like an I/O MMU, or with custom logic typically using a CAM.    -   5. Management CPU 314 on server B returns the base intermediate        physical address of the window.    -   6. Management CPU 314 on server A installs a mapping into the        local I/O MIVIU mapping from an IPA window on server A to the        server B IPA window base address.    -   7 Management CPU 314 on server A returns the allocated local IPA        address for the requested window to the requesting client on the        main OS processor 312.

In the described examples, DMA engines on both the local (server A) andremote (server B) sides can be used to hardware facilitate data movementin either direction. Users are not constrained to the classic push ORpull data movement model. Further, many SOC bus transaction models havesome notion of trust or security zone associated with that bustransaction. As an example, ARM AXI has the notion of TrustZone, wheretransactions are marked as being in Trusted World or Normal World. TheRemote Bus portion in the Personality Module 300 illustrated in FIG. 3annotates the bus transaction packet with the trust or security zonewith the incoming bus transaction. When the remote server (e.g. serverB) is issuing the remote transaction into the local bus fabric, aconfiguration option is used to define whether the transactions getissued with either the security zone of the requesting processor, orissued at a specific security zone level.

Example 4: Remote Bus Personality Module I/O Physicalization

FIG. 6 illustrates using the Remote Bus Personality portion of theModule 300 of FIG. 3 (i.e., the Remote Bus Personality Module) for I/OPhysicalization. Some data center customers desire to have computeservers that have no embedded storage or I/O within the server, and thenseparate I/O boxes/chassis within the data center rack. The Remote BusPersonality Module of FIG. 3 allows multiple servers, designated as SrvrA, B, and C in FIG. 6 , to use unmodified device drivers within theoperating systems running in servers A, B, and C to access physicallyremote I/O devices across the server fabric. The server operatingsystem, device drivers, and applications believe that they arecommunicating with server local devices. Use of the Remote BusPersonality Module of FIG. 3 allows the device I/O and interrupts to theactual I/O device to be bi-directionally remoted across the fabric withno changes or visibility to software.

Device drivers running on CPUs in the Server boxes (A, B, C) of FIG. 6access I/O registers transparently across Fabric 608 in the remotedperipheral controller cards, illustrated as remote PCIe controllers610/612 and remote SATA controllers 614/616 in FIG. 6 . Direct memoryaccess (“DMA”) engines are located either in the server boxes, oralternatively in the I/O boxes embedded within the peripheralcontrollers, and the DMA traffic is remoted bi-directionallytransparently across Fabric 608. Additionally, interrupts generated bythe remote peripheral controllers are transparently transmitted acrossFabric 608 and presented to the processors in servers A, B, or C. Inthis manner, the Remote Bus Personality Module enables remote memoryaccess functionality which includes the ability to allow memory capacityto be provisioned based on workload on a per-node basis, to load/storefrom remote memory, to perform remote DMA transactions, and to performremote interrupts.

The address maps, both I/O and memory, and interrupt maps are maintainedand transmitted transparently across Fabric 608. In this example, thedata flow is completely optimized. An example storage block transferfrom SATA controller 614/616 of FIG. 6 would typically become:

-   -   The device driver on Srvr B is reading a block from remote SATA        614 connected SSD 620 to a pre-allocated block buffer on a        physical address PA1.    -   The device driver programs and initiates the read by writing the        appropriate control registers in remote SATA controller 614.    -   Remote SATA controller 614 contains an embedded DMA engine which        initiates the DMA, reading the data from the remoted SSD, and        landing the data directly into physical address PA1 in Srvr B's        address space.    -   No network communication or additional data copies were needed        in this optimized transfer.

Example 5: Remote Bus Personality Module Enabling High PerformanceDistributed Shared Storage

FIG. 7 illustrates an alternate distributed storage example. Distributedstorage functionality is an embodiment of remote memory accessfunctionality in which the remote memory is non-volatile memory (i.e.,storage type memory), In this case the computational servers areillustrated as Srvr A, B, C. The I/O server boxes containing the storageperipherals in this use case have processors as well. This highperformance shared storage example has one additional data movement fromthe example 4, I/O physicalization. But this example 5 adds theadditional capabilities that the I/O devices and controllers can beshared by multiple servers.

In FIG. 7 a method of storage block transfer from a SATA controller isas follows.

-   -   The device driver on Srvr A is reading a block from remote SATA        714 connected SSD 716 to a pre-allocated block buffer on a        physical address PA1.    -   The read is initiated by sending a lightweight message across        Fabric 708 from Srvr A to Target I/O server 720 that contains        the description of the read (device, block, size) and the        physical address in Srvr A that the data should be moved to.    -   The driver on SATA device 714 on Target I/O server 720 initiates        the DMA read to its local buffer from its local SATA controller.    -   Upon the completion of the DMA transfer to the I/O servers        buffer, the device driver on the I/O server 720 uses a local DMA        engine to initiate a fabric remoted DMA transfer from its local        buffer to the physical address of the buffer in the requesting        server's address space.    -   The device driver programs and initiates the read by writing the        appropriate control registers in controller of remote SATA 714.

This example requires one additional data movement as compared to theI/O Physicalization example 4, but is far more efficient than atraditional network oriented SAN or NAS remote storage data movement.

The discussion now turns to disassociation of memory (e.g., preferablymutable memory) from a cluster of nodes while enabling those nodes theability to do full load/store/barrier instructions to a memory pool(e.g., aggregation of memory resources provided at a centralizedlocation) through allocation of memory of the memory pool to the nodesbased on workload on a per-node basis. Such implementation is referredto herein as pooled memory functionality. Implementing pooled memoryfunctionality in this manner supports allocation of memory privately ona per node basis and allocation of memory to all or a portion of thenodes in a non-coherent, shared manner. Furthermore, in view of thedisclosures made herein, a skilled person will appreciate that remotememory access functionality in accordance with the present inventionsupports implementation of near shared memory using, for example, HMC(hybrid memory cubes) memory resources and supports implementation offar shared memory over a SoC node fabric using, for example, both HMCand DDR memory resources.

A node cluster architecture 800 is shown in FIG. 8A. The node clusterarchitecture 800 is configured for providing remote memory accessfunctionality in accordance with the present invention. Morespecifically, the node cluster architecture 800 includes a plurality ofcompute nodes 805 and a plurality of memory controller nodes 810 thatare connected via a fabric 815 (i.e., links extending between fabricswitches of interconnected nodes). Each one of the memory controllernodes 810 has memory 820 coupled thereto. Jointly, the memory 820attached to all or a portion of the memory control nodes 810 is referredto herein as pooled memory. Preferably, aside from resident memoryprovisioning, the underlying architecture of the compute nodes 805 andthe memory controller nodes 810 is entirely or substantially the same.

A plurality of the compute nodes 805 can be provided on a single card(i.e., a compute node card) and a plurality of the memory controllernodes 810 can be provided on a single card (i.e., a memory controllernode card). The compute node card and memory controller node card canhave identical overall planar dimensions such that both types of cardshave a common or identical planar form factor. Each compute node 805 andeach memory controller node 810 can have a plurality of SoC unitsthereon that provide information processing functionality. Bydefinition, a compute node card will be populated more densely with SoCunits that will be a memory controller node card. Preferably, but notnecessarily, an architecture of the SoC units of the compute node cardsis substantially the same or identical to that of the memory controllernode cards,

The compute nodes 805 are each provisioned (i.e., configured) with alimited amount of local memory 807 and are packaged together (i.e.,integrated with each other) with the goal of optimizing compute densitywithin a given form factor (i.e., maximizing computer density in regardto cost, performance, space, heat generation, power consumption and thelike). The memory controller nodes 810 are provisioned with a relativelylarge amount of local memory and together provide the pooled memoryresource at a chassis, rack or cluster level (i.e., to maximizing poledmemory in regard to cost, performance, space, heat generation, powerconsumption and the like for a given form factor). Put differently, acompute node card has insufficient memory resources for enablingintended data computing performance (e.g., data processing throughput)of compute nodes thereof and a memory controller node card hasinsufficient node CPU resources for enabling intended data computingperformance (e.g., put/get and/or load/store utilization) of the pooledmemory thereof. In this regard, intended data computing functionality ofthe server apparatus requires that the server apparatus include at leastone computer node card and at least one memory controller card

Each compute node 805 can be allocated a portion of the pooled memory820, which then serves as allocated memory to that particular one of thecompute nodes 805. In this regard, the pooled memory 820 can beselectively allocated to and be selectively accessed by each one of thenodes (i.e., via pooled memory functionality). As shown in FIG. 8B, theone or more memory controller nodes 810 and associated pooled memory 820(e.g., DDR as shown or HMC) can be implemented in the form of a memorycontroller node chassis 821. As shown in FIG. 8C, the memory controllernode chassis 821 can be utilized in a rack 822 with a plurality ofcompute node chassis 823 that share memory resources of the memorycontroller node chassis 821. In this regard, one or more compute nodes805 (or cards comprising same) and one or more memory controller nodes810 with associated pooled memory 820 (or cards comprising same) can bereferred to as a pooled memory server apparatus. It is also disclosedherein that a pooled memory server apparatus configured in accordancewith the present invention can include a storage controller node chassisthat is similar to the memory controller chassis except with storageresources (e.g., non-volatile storage resources such as hard diskdrives) as opposed to memory resources (e.g., RAM).

In view of the disclosures made herein, a skilled person will appreciatethat an underlying goal of the node cluster architecture 800 is toprovide a fabric attached pool of memory (i.e., pooled memory) that canbe flexibly assigned to compute nodes. For example, in the case of adense node board such as that offered by Calxeda Inc under the trademarkEnergyCard, every node of the compute node card (i.e., a plurality ofnodes on a single board substrate) has a constrained, small number ofDIMMs (e.g., every compute node having a constrained, small no. of DIMMs(e.g., 1)) and requires every node to have a relatively constrainedamount of DRAM (e.g., every compute node to have something 4-8 GB ofDRAM). But, in practical system implementations, some nodes will needdifferent memory provisioning for specific requirements thereof (e.g.,for Hadoop NameNode functionality, for Memcache functionality, fordatabase functionality).

Pooled memory in accordance with embodiments of the present invention,which is attached to computer nodes though a fabric (i.e., fabric memorypools), support standardized dense node cards such as the Calxeda brandEnergyCard but allows them to be memory provisioned differently. In onespecific implementation (shown in FIG. 8A), the bulk of the node cardsin a cluster are cards with compute nodes (i.e., compute node cards).These compute node cards are configured with memory that is optimizedwith respect to capacity, power, and cost (e.g., one DIMM per channel).A variant of the compute node cards are cards are configured withassociated pooled memory (i.e., pooled memory cards). The pooled memorycards, which are memory controller node cards in combination withassociated pooled memory thereon, can be configured as maximum DRAMcapacity cards. For example, the pooled memory cards can utilizemultiple DIMMs per channel, RDIMMs at high densities (and higher power)or the like. This additional DRAM power is amortized across the fabricbecause there are likely a relatively small number of these pooledmemory cards in comparison to compute node cards.

Embodiments of the present invention allow for pooled memory cards to bephysically provisioned in a variety of different configurations. Insupport of these various physical provisioning configurations, pooledmemory cards can be provisioned based on DIMM density (e.g., maximizedDIMM density) or can be provisioned based on DRAM capacity (e.g.,maximized DRAM capacity). In regard physical placement of the pooledmemory cards, various rack and chassis positioned are envisioned. In oneimplementation (i.e., chassis provisioning), all or a portion of thepooled memory cards are configured for maximum DRAM capacity and serveas a chassis fabric memory pool. In another implementation (i.e., rackprovisioning), a memory appliance (1U or 2U) is fabric connected withinthe rack using pooled memory cards are configured for maximum DRAMcapacity. In another implementation (i.e., end of row provisioning), anentire rack is provided with pooled memory cards and serves as a memoryrack that is at the end of a row of racks with computer nodes (i.e.,compute racks). In still another implementation (i.e., distributedprovisioning), all pooled memory cards are configured for maximum DRAMcapacity and Linux NUMA APIs are used to create a distributed far memorypool. Additionally, Linux can even round-robin pages across the NUMAmemory pool.

FIG. 9 shows a memory hierarchy structure 900 of each one of thecomputer nodes 805. As shown, the memory hierarchy structure 900 of eachone of the computer nodes 805 has various memory resources. Ofparticular interest to remote memory access functionality implemented inaccordance with the present invention is Remote Memory Layer 905, whichintroduces an additional level into the memory hierarchy structure 900of each compute node. The Remote Memory Layer 905 enables a SoC (i.e.,system) architecture where memory resources can be pooled at the clusterlevel and allocated amongst the nodes in a cluster (i.e., a plurality ofnodes interconnected by a fabric). The Remote Memory Layer 905 allowsmemory capacity per node to be changed based on workload needs bychanging the amount of pooled memory that is provisioned per node. Thisdisaggregation and pooling of memory resources at the cluster levelprovides for better overall memory capacity utilization and lower power.Furthermore, the Remote Memory Layer 905 supports two types of accessesto remote memory that is mapped into a node's physical address space. a)coarse-grain accesses that rely in virtual-memory paging and involvestransferring pages between remote and local memories and, b) fine-grainaccesses that trigger cacheline transfers from the remote memory as aresult of loads/stores from a node's operating system CPU to remotememory.

FIG. 10 shows a functional block diagram 1000 configured forimplementing remote memory access functionality. The functional blockdiagram 1000 supports remote memory by a compute node 1005 (i.e., one ofa plurality of computer nodes) across a fabric 1010. A MessagingPersonality Module (i.e., the Messaging PM 1015) of the compute node1005 serves as a hardware interface to remote DRAM 1020 (i.e., remotememory). The Messaging PM 1015 is connected to a cache coherentinterconnect 1025 of the computer node 1005 such as through AXI Masterand Slave interfaces thereof. The cache coherent interconnect 1025 hasdirect access to internal SRAM 1026 of the computer node 1005. LocalDRAM 1127 (e.g., on a card level substrate on which the node is mounted)is coupled to the cache coherent interconnect 1025 via one or morememory controllers 129. Remote memory addresses of the remote DRAM 1020are mapped to the Messaging PM 1005 through the AXI Master Port on thecache coherent interface 1025. Loads and stores to the remote DRAM 1020by management cores 1030 (i.e., management processors) and operatingsystem cores 1035 (i.e., OS processors) are diverted to the Messaging PM1015, which then encapsulates these accesses in fabric packets andtransports them to a memory controller node 1040 that serves the remoteDRAM 1020 (i.e., the receiving controller node 1140). The memorycontroller node 1040 that serves the remote DRAM 1020 includes aninstance of the Messaging PM. The receiving Messaging PM 1040 performsthe requested access by reading or writing the local memory (e.g., localDRAM) of the memory controller node through its cache coherentinterconnect.

In one embodiment, the functional block diagram 1000 is implemented viacomponents of the multi-protocol personality module 300 discussed abovein reference to FIG. 3 . The Messaging PM 1005 can be embodied by theRemote interrupt Manager 303, the Remote Address translation module 305,the Bus Bridge 306 and the Remote Bus Processor 307. The cache coherentinterconnect 1025 can be embodied by the bus fabric 301. The fabric 1010can be implemented via one or more ports accessible to the fabric switch308 for enabling access to the remote DRAM 1020.

In some embodiments of the present invention, the allocation of pooledmemory (i.e., memory associated with one or more memory controllernodes) to individual compute nodes can managed by a cluster-level memorymanager. This memory manager can be a software entity that is astandalone management entity or that is tightly integrated into othercluster-level management entities such as, for example, a job scheduler,a power management entity, etc. The allocation of the remote memory thatis mapped into address space of a compute node to applications runningon that computer node can be managed by an operating system (OS) or avirtual memory manager (VMM) using known virtual memory management andmemory allocation techniques. For example, the OS and/or VMM can employnon-uniform memory access (NUMA) memory allocation techniques todistinguish between allocation of local memory and remote memory.

In view of the disclosures made herein, a skilled person will recognizethat embodiments of the present invention enable various mechanisms ofpooled memory functionality to be implemented. Pooled memoryfunctionality is a specific implementation of remote memory accessfunctionality. Examples of these mechanisms of pooled memoryfunctionality include, but are not limited to, remote memory beingmapped to physical address space of a node, load/store access beingcarried out from a CPU of a node, get/put access from user space, andDMA memory content transactions from remote memory to local memory. Thebenefits of these mechanisms of pooled memory functionality include, butare not limited to, disaggregated memory that can be used acrossmultiple SoC generations, computer nodes can be assigned total memorybased on workload characteristics, get/put into remote memory enableslow-latency optimizations (e.g., via key/value stores, memcached, etc).

The remote memory architecture embodied within the functional blockdiagram 1000 can support two primary styles of pooled memoryfunctionality. A first one of these styles of pooled memoryfunctionality relates to shared remote memory. A second one of thesestyles of pooled memory functionality relates to disaggregated privatememory. These use cases differ in whether an allocated portion of thepooled memory (i.e., remote memory) is mapped into the address space ofa compute node and in how the allocated portion of the pooled memory isaccessed.

The style of pooled memory functionality relating to shared remotememory involves remote memory get/put operations. In this style ofpooled memory functionality, processor initiated bus cycles (i.e.load/stores) would not be directly remoted across the fabric. Rather,very low-latency user-space proxies for direct load/stores would beprovided. These remote memory accesses represent get/put and/orload/store operations.

In the case pooled memory functionality relating to disaggregatedprivate memory, as shown in FIG. 11 , a physical address space 1105 of aparticular compute node (e.g., a particular one of the compute nodes 805shown in FIG. 8A) has local physical memory 1110 residing at its bottomportion and has the allocated remote memory (i.e., allocated remotememory 1115) mapped into its higher physical addresses. The allocatedremote memory is not shared with any other nodes but is cacheable bymanagement and OS cores of the node. Furthermore, the allocated remotememory is not directly accessible by user-space applications. In otherwords, accesses to allocated remote memory use physical addressesgenerated either by the paging mechanism implemented by the OS/VMM or bya memory management unit of the node's central processing unit. Accessesto allocated remote memory will typically higher latencies compared toaccesses to local memory. This is due at least in part to memorybandwidth of the allocated remote memory being constrained by bi-sectionbandwidth of the fabric interconnecting the computer nodes and willlikely be lower than the memory bandwidth of the local memory.Therefore, well-known memory hierarchy concepts such as caching andpre-fetching can be utilized for optimizing accesses to the allocatedremote memory.

A primary goal of disaggregated private memory is to provide a fabricattached pool of memory (i.e., fabric attached pooled memory) that canbe flexibly assigned to compute nodes. Native load/store transactionssupported over a fabric. Examples of these native load/storetransactions include, but are not limited to, transactions associatedwith global fabric address space, transactions associated with computenodes carrying out read/write operations to remote memory, andtransactions associated with remote DMA of memory content into physicalmemory of a compute node. In implementing disaggregated private memoryin accordance with embodiments of the present invention, compute nodeswill have private memory (e.g., private mutable memory) and can share apool of fabric accessible memory (e.g., cacheable, non-coherent sharedmemory). Furthermore, fabric pool memory configured in accordance withembodiments of the present invention can be implemented within a chassisor across a largest possible fabric (e.g., across one or more rack).

Implementations of disaggregated private memory as disclosed herein canbe considered as a class of remote NUMA memory (i.e., one-sided cachecoherent which is also known as I/O coherent). For example, certaincommercially available operating systems (e.g., Linux brand operatingsystems) have support for NUMA memory in the form of a NUMA subsystem,More specifically, Linux brand operating systems have NUMA awarenesssuch as via numactl (e.g., control NUMA policy for processes or sharedmemory), Lib numa (e.g., NUMA policy API), and enhanced topologydetection. Additionally, malloc-type memory allocation functionality isconfigured to ensure that the regions of memory that are allocated to aprocess are as physically close as possible to the core on which theprocess is executing, which increases memory access speeds. A nodecluster architecture configured in accordance with the present inventioncan be configured to integrate with such a NUMA subsystem for allowingkernel and applications to have control of memory locality withouthaving to expose new APIs and malloc-type memory allocationfunctionality for increasing memory access speeds.

Implementations of disaggregated private memory as disclosed herein canutilize device controllers (e.g., memory device controllers) that arephysically allocated to remote nodes. This type of implementation isexemplified herein in the discussion relating to Example 4 and FIG. 6 .Utilizing device controllers that are physically allocated to remotenodes allows the centralization of memory controllers and memory deviceson a set of nodes. For example, the memory controllers and memorydevices can be allocated to remote nodes at run-time whereby driverscontinue to run on nodes acting as servers, drivers directly accessremote memory controllers (e.g., of memory controller nodes), andDMA/interrupts are implemented transparent over the fabric thatinterconnects the nodes.

Example 6: Memcached Server Revolution

FIG. 12 illustrates an embodiment of the present invention configuredfor providing memcached server functionality 1200. The memcached serverfunctionality 1200 utilizes pooled memory disclosed herein in accordancewith the present invention. Memcached server functionality in accordancewith the present invention is applicable to a large class of key-valuestore storage. Advantageously, implementation of the memcached serverfunctionality 1200 in accordance with the present invention allowsmemcached clients 1205A, 1205B (e.g. web servers) to reach back to apooled memory 1210 (i.e., memcached memory pool) to get cached values ofdata without having to go back to their respective database tier. Forexample, the pooled memory 1210 can be implemented as NUMA fabric pooledmemory. The memcached clients 1205A, 1205B can be embodied by one ormore compute nodes that are each allocated respective private mutableprivate memory 1215A, 1215B from pooled memory associated with one ormore memory controller nodes. To this end, the memcached clients 1205,the pooled memory 1210, and the private mutable private memory 1215 canbe embodied by the pooled memory server apparatus discussed above inreference to FIGS. 8 a -8 c.

The memcached clients 1205A, 1205B each map access information (e.g., akey) directly and reach into the pooled memory 1210 to obtain the datawith a direct memory load. In this manner, unlike the traditionalmemcached approach, there is no networking needed for access memcacheddata. Each one of the memcached servers 1210 a-e hashes into local DRAMand returns the hashed value over TCP/IP or UDP, which serves as thecommunication protocol between the memcached servers and each one of thememcached clients 1205.

In regard to a specific example in which a cluster of SoC nodes (i.e.,including Node A and Node B) that are interconnected by a nodeinterconnect fabric, Node A (e.g., through web server functionalitythereof) requests an account lookup for Account #100. Web server requestgoes through a Memcached client API into a memcached client library witha cache data request for Key ID #100. The memcached client libraryhashes Key ID #100 to the memcached server that holds that data wherebyit hashes to Node B that is providing memcached server functionality.The memcached client library determines that Node A and Node B have aremote memory capable fabric between them (e.g., are configured forproviding remote memory access functionality in accordance with thepresent invention). The memcached client library on Node A performs aserver-side hash of Key ID #100 and uses a remote memory access to nodeB to determine if this data is currently encached and, if so, the memoryaddress that contains the data. In the case where it is determined thatthe data is currently encached, the memcached client library on Node Adirectly access the remote cached data from Node B's memory addressspace (e.g., memory address space of Node B's memcached serverfunctionality). The memcached client library then returns the data tothe requesting web server on Node A.

Example 7: High Frequency Trading Backend

In support of high frequency trading, stock exchange tick data canstream as multicast packets at rates up to 6 MB/sec or more. The tickdata can be highly augmented with derived data thereof A fabric memorypool apparatus is used to store the tick data in one place and accessedby a plurality of trading servers. Referring to the pooled memory serverapparatus discussed above in reference to FIGS. 8 a-8 c , the fabricmemory pool apparatus can be embodied in the form of the memorycontroller node chassis 821 and the trading servers can be embodied inthe form of the compute node chassis 823, The tick data is only appendedsuch that the tick data does not have to be multicast and replicated.Furthermore, all compute nodes of the trading servers get directread-only shared access to the tick data (i.e., via pooled memory of thememory controller nodes) whereby the tick data is still CPU cacheablefor frequently accessed data.

Example 8: Message Passing Interface Remote Memory Access (One Sided)

The underlying premise of message passing interface (MPI) remote memoryaccess (RMA) relates to any allocated memory is private to the MPIprocess by default. As needed, this allocated private memory can beexposed to other processes as a public memory region. To do this, an MPIprocess declares a segment of its memory to be part of a window,allowing other processes to access this memory segment using one-sidedoperations such as PUT, GET, ACCUMULATE, and others, Processes cancontrol the visibility of data written using one-sided operations forother processes to access using several synchronization primitives.Referring to the pooled memory server apparatus discussed above inreference to FIGS. 8 a-8 c , memory of the MPI process can be embodiedin the form of the memory controller node chassis 821.

MPI 3rd generation (i.e., MPI-3) RMA offers two new window allocationfunctions. The first new window allocation function is a collectiveversion that can be used to allocate window memory for fast access. Thesecond new window allocation function is a dynamic version which exposesno memory but allows the user to “register” remotely-accessible memorylocally and dynamically at each process. Furthermore, new atomicoperations, such as fetch-and-accumulate and compare-and-swap offer newfunctions for well-known shared memory semantics and enable theimplementation of lock-free algorithms in distributed memory.

Example 9: Partitioned Global Address Space Languages (PGAS)

Examples of common PGAS languages include, but are not limited to,Unified Parallel C, Co-Array Fortran, Titanium, X-10, and Chapel. Asshown in FIG. 13A, in these PGAS languages, memory distributed over manycompute nodes (i.e., distributed memory 1300) is seen as one globalmemory (i.e., global memory 1305) that can be accessed by all theprocesses without requiring explicit communication like in MPI. Hiddencommunication is based on one-sided communication. As shown in FIG. 13B,PGAS languages introduce the concept of a global memory space 1310 thatis partitioned between the participating threads 1315 (e.g., ranks inMPI) with each process being able to access both local memory (e.g.,distributed memory 1300 local to a particular computer node) and remotememory (e.g., distributed memory 1300 local to a different computer nodethan the particular computer node). Access to local memory is viastandard sequential program mechanisms whereas access to remote memoryis directly supported by the new features of the PGAS language and isusually done in a “single-sided” manner (unlike the double-sided ofMPI). The single-sided programming model is more natural than the MPIalternative for some algorithms. In accordance with embodiments of thepresent invention, RDMA and remote memory functionalities allowefficient PGAS capability to be provided. Referring to the pooled memoryserver apparatus discussed above in reference to FIGS. 8 a-8 c , theglobal memory 1305 can be embodied in the form of the memory controllernode chassis 821.

Example 10: Disaggregated Server Resources

Currently, disaggregation of server resources is limited to separatingcompute resources (e.g., CPU and RAM) from storage via separate chassisthat are connected via an interface such as, for example, PCIe or SAS.However, data centers and other types of server operating entities willbenefit from disaggregation of CPU resources, storage resources, andmemory resources. This will allow server operating entities toreplace/update CPU resources, storage resources, and memory resources(i.e., server resources) at their respective lifecycle timeframe withouthaving to replace/update one server resource at the particular lifecycletimeframe of another server resource. Advantageously, embodiments of thepresent invention can provide for such disaggregation of CPU resources,storage resources, and memory resources. In particular, embodiments ofthe present invention provide for the disaggregation of RAM (i.e.,memory resources) from compute node cards (i.e., CPU resources) so thatCPU resources can be replaced/updated as new CPU resources (e.g.,processors) are released whereas memory resources (e.g., RAM,non-volatile storage, etc) can remain in use as long as they areefficient and/or effectively functional. To this end, referring to thepooled memory server apparatus discussed above in reference to FIGS. 8a-8 c , the memory resources can be embodied in the form of the memorycontroller node chassis 821 (i.e., a first physical enclosure unit), theCPU resources can be embodied in the form of the compute node chassis823 (i.e., a second physical enclosure unit), and the storage resourcescan be embodied in the form of the storage controller node chassis(i.e., a third physical enclosure unit). Memory resources can be in theform of one or more HMCs.

Example 11: Hybrid Memory Cube (HMC) Deployed Near Memory Pool

As shown in FIG. 14A, pooled memory functionality in accordance with thepresent invention can be implemented in the form of HMC deployed nearmemory pools. In such an implementation, a HMC unit (i.e., pooledmemory) is shared by a plurality of compute nodes 1410 (i.e., the sharedHMC unit 1405). For example, the compute nodes 1410 can all be of acommon compute node card such as the Calxeda brand EnergyCard. As shown,each one of the compute nodes 1410 can also have respective base memory1415. In this manner, the compute nodes 1410 can have non-coherent,shared memory and, optionally, private mutable memory. Referring to thepooled memory server apparatus discussed above in reference to FIGS. 8a-8 c , the HMC unit can be embodied in the form of the memorycontroller node chassis 821 and the compute nodes 1410. CPU resourcescan be embodied in the form of the compute node chassis 823 and the HMCunit 1405 can be embodied in the form of the memory controller chassis.As a skilled person will appreciate, the near memory pools implementedwith HMC units do not require a fabric for data communication. As shownin FIG. 14B, each one of the compute nodes 1410 has a respective privateHMC 1420. The private HMC 1420 and the shared HMC 1405 each provide HMClinks for supporting communication of data therebetween (e.g., 16-laneHMC link with 40 GB/sec link capacity). For example, the HMC units caneach include a cache coherent interconnect 1425 (e.g., a fabric bus)having two memory ports (e.g., 25 GB/sec link capacity each) eachcoupled to a respective HMC controller 1430 by a bridge 1435. In view ofthe disclosures made herein, a skilled person will appreciate thecompute nodes 1410 can be SoC nodes that are interconnected to eachother through a node interconnect fabric and that access to the memoryresources of the HMC unit 1405 is made over a respective communicationlink of the HMC unit 1405 without traversing any communication channelof the node interconnect fabric.

Example 12: Far Memory Pool Using Pooled Memory Functionality

Pooled memory functionality in accordance with the present invention canbe implemented in the form of far memory pools. In such animplementation, pooled memory is shared by a plurality of compute nodessuch as those of a compute node chassis configured in accordance withthe present invention. The shared memory can be in the form of cachecoherent DDR or cache coherent HMC such as that of a memory controllerchassis configured in accordance with the present invention. The sharedmemory is accessed via a fabric that interconnects the computer nodes.Preferably, but not necessarily, the compute nodes are all of a commoncompute node card such as the Calxeda brand EnergyCard. Each one of thecompute nodes can also have respective base memory. In this manner, thecompute nodes can have non-coherent, shared memory and, optionally,private mutable memory.

In summary, in view of the disclosures made herein, a skilled personwill appreciate that a system on a chip (SOC) refers to integration ofone or more processors, one or more memory controllers, and one or moreI/O controllers onto a single silicone chip. Furthermore, in view of thedisclosures made herein, the skilled person will also appreciate that aSOC configured in accordance with the present invention can bespecifically implemented in a manner to provide functionalitiesdefinitive of a server. In such implementations, a SOC in accordancewith the present invention can be referred to as a server on a chip, Inview of the disclosures made herein, the skilled person will appreciatethat a server on a chip configured in accordance with the presentinvention can include a server memory subsystem, a server I/Ocontrollers, and a server node interconnect. In one specific embodiment,this server on a chip will include a multi-core CPU, one or more memorycontrollers that supports ECC, and one or more volume server I/Ocontrollers that minimally includes Ethernet and SATA controllers. Theserver on a chip can be structured as a plurality of interconnectedsubsystems, including a CPU subsystem, a peripherals subsystem, a systeminterconnect subsystem, and a management subsystem.

An exemplary embodiment of a server on a chip that is configured inaccordance with the present invention is the ECX-1000 Series server on achip offered by Calxeda incorporated. The ECX-1000 Series server on achip includes a SOC architecture that provides reduced power consumptionand reduced space requirements. The ECX-1000 Series server on a chip iswell suited for computing environments such as, for example, scalableanalytics, webserving, media streaming, infrastructure, cloud computingand cloud storage. A node card configured in accordance with the presentinvention can include a node card substrate having a plurality of theECX-1000 Series server on a chip instances (i.e., each a server on achip unit) mounted on the node card substrate and connected toelectrical circuitry of the node card substrate. An electrical connectorof the node card enables communication of signals between the node cardand one or more other instances of the node card.

The ECX-1000 Series server on a chip includes a CPU subsystem (i.e., aprocessor complex) that uses a plurality of ARM brand processing cores(e.g., four ARM Cortex brand processing cores), which offer the abilityto seamlessly turn on-and-off up to several times per second. The CPUsubsystem is implemented with server-class workloads in mind and comeswith a ECC L2 cache to enhance performance and reduce energy consumptionby reducing cache misses. Complementing the ARM brand processing coresis a host of high-performance server-class I/O controllers via standardinterfaces such as SATA and PCI Express interfaces. Table 3 below showstechnical specification for a specific example of the ECX-1000 Seriesserver on a chip.

TABLE 3 Example of ECX-1000 Series server on a chip technicalspecification Processor Cores 1. Up to four ARM ® Cortex ™-A9 cores @1.1 to 1.4 GHz 2. NEON ® technology extensions for multimedia and SIMDprocessing 3. Integrated FPU for floating point acceleration 4. Calxedabrand TrustZone ® technology for enhanced security 5. Individual powerdomains per core to minimize overall power consumption Cache 1. 32 KB L1instruction cache per core 2. 32 KB L1 data cache per core 3. 4 MBshared L2 cache with ECC Fabric Switch 1. Integrated 80 Gb (8 × 8)crossbar switch with through-traffic support 2. Five (5) 10 Gb externalchannels, three (3) 10 Gb internal channels 3. Configurable topologycapable of connecting up to 4096 nodes 4. Dynamic Link Speed Controlfrom 1 Gb to 10 Gb to minimize power and maximize performance 5. NetworkProxy Support to maintain network presence with even node powered offManagement 1. Separate embedded processor dedicated for Engine systemsmanagement 2. Advanced power management with dynamic power capping 3.Dedicated Ethernet MAC for out-of-band communication 4. Supports IPMI2.0 and DCMI management protocols 5. Remote console support viaSerial-over-LAN (SoL) Integrated 1. 72-bit DDR controller with ECCsupport Memory 2. 32-bit physical memory addressing Controller 3.Supports DDR3 (1.5 V) and DDR3L (1.35 V) at 800/1066/1333 MT/s 4. Singleand dual rank support with mirroring PCI Express 1. Four (4) integratedGen2 PCIe controllers 2. One (1) integrated Gen1 PCIe controller 3.Support for up to two (2) PCIe x8 lanes 4. Support for up to four (4)PCIe x1, x2, or x4 lanes Networking 1. Support 1 Gb and 10 Gb EthernetInterfaces 2. Up to five (5) XAUI 10 Gb ports 3. Up to six (6) 1 GbSGMII ports (multiplexed w/XAUI ports) 4. Three (3) 10 Gb Ethernet MACSsupporting IEEE 802.1Q VLANs, IPv4/6 checksum processing, andTCP/UDP/ICMP checksum offload 5. Support for shared or privatemanagement LAN SATA Controllers 1. Support for up to five (5) SATA disks2. Compliant with Serial ATA 2.0, AHCI Revision 1.3, and eSATAspecifications 3. SATA 1.5 Gb/s and 3.0 Gb/s speeds supported SD/eMMC 1.Compliant with SD 3.0 Host and MMC 4.4 Controller (eMMC) specifications2. Supports 1 and 4-bit SD modes and 1/4/8-bit MMC modes 3. Read/writerates up to 832 Mbps for MMC and up to 416 Mbps for SD System 1. Three(3) 12C interfaces Integration 2. Two (2) SPI (master) interfaceFeatures 3. Two (2) high-speed DART interfaces 4. 64 GPIO/Interrupt pins5. JTAG debug port

FIG. 15 shows a SoC unit (i.e., SoC 2200) configured in accordance withan embodiment of the present invention. More specifically, the SoC 2200is configured for implementing discovery functionalities as disclosedherein. The SoC 2200 can be utilized in standalone manner.Alternatively, the SoC 2200 can be utilized in combination with aplurality of other SoCs on a node card such as, for example, with eachone of the SoCs being associated with a respective node of the nodecard.

The SoC 2200 includes a node CPU subsystem 2202, a peripheral subsystem2204, a system interconnect subsystem 2206, and a management subsystem2208. In this regard, a SoC configured in accordance with the presentinvention can be logically divided into several subsystems. Each one ofthe subsystems includes a plurality of operation components therein thatenable a particular one of the subsystems to provide functionalitythereof. Furthermore, each one of these subsystems is preferably managedas independent power domains.

The node CPU subsystem 2202 of SoC 2200 provides the core CPUfunctionality for the SoC, and runs the primary user operating system(e.g. Ubuntu Linux). The Node CPU subsystem 2202 comprises a node CPU2210, a L2 cache 2214, a L2 cache controller 2216, memory controller2217, and main memory 2219. The node CPU 2210 includes 4 processingcores 2222 that share the L2 cache 2214. Preferably, the processingcores 2222 are each an ARM Cortex A9 brand processing core with anassociated media processing engine (e.g., Neon brand processing engine)and each one of the processing cores 2222 can have independent L1instruction cache and L1 data cache. Alternatively, each one of theprocessing cores can be a different brand of core that functions in asimilar or substantially the same manner as ARM Cortex A9 brandprocessing core. Each one of the processing cores 2222 and itsrespective L1 cache is in a separate power domain. Optionally, the mediaprocessing engine of each processing core 2222 can be in a separatepower domain. Preferably, all of the processing cores 2222 within thenode CPU subsystem 2202 run at the same speed or are stopped (e.g.,idled, dormant or powered down).

The memory controller 2217 is coupled to the L2 cache 2214 and to aperipheral switch of the peripheral subsystem 2204. Preferably, thememory controller 2217 is configured to control a plurality of differenttypes of main memory (e.g., DDR3, DDR3L, LPDDR2). An internal interfaceof the memory controller 2217 can include a core data port, aperipherals data port, a data port of a power management unit (PMU)portion of the management subsystem 2208, and an asynchronous 32-bit AHBslave port. The PMU data port is desirable to ensure isolation for somelow power states. The asynchronous 32-bit AHB slave port is used toconfigure the memory controller 2217 and access its registers. Theasynchronous 32-bit AHB slave port is attached to the PMU fabric and canbe synchronous to the PMU fabric in a similar manner as the asynchronousinterface is at this end. In one implementation, the memory controller2217 is an AXI interface (i.e., an Advanced eXtensible Interface).

The peripheral subsystem 2204 of SoC 2200 has the primary responsibilityof providing interfaces that enable information storage and transferfunctionality. This information storage and transfer functionalityincludes information storage and transfer both within a given SoC Nodeand with SoC Nodes accessibly by the given SoC Node. Examples of theinformation storage and transfer functionality include, but are notlimited to, flash interface functionality, PCIe interface functionality,SATA interface functionality, and Ethernet interface functionality. Theperipheral subsystem 2204 can also provide additional informationstorage and transfer functionality such as, for example, direct memoryaccess (DMA) functionality. Each of these peripheral subsystemfunctionalities is provided by one or more respective controllers thatinterface to one or more corresponding storage media (i.e., storagemedia controllers).

The peripherals subsystem 2204 includes the peripheral switch and aplurality of peripheral controllers for providing the abovementionedinformation storage and transfer functionality. The peripheral switchcan be implemented in the form of a High-Performance Matrix (HPM) thatis a configurable auto-generated advanced microprocessor busarchitecture 3 (i.e., AMBA protocol 3) bus subsystem based around ahigh-performance AXI cross-bar switch known as the AXI bus matrix, andextended by AMBA infrastructure components.

The peripherals subsystem 2204 includes flash controllers 2230 (i.e. afirst type of peripheral controller). The flash controllers 2230 canprovide support for any number of different flash memory configurations.A NAND flash controller such as that offered under the brand name Denaliis an example of a suitable flash controller. Examples of flash mediainclude MultiMediaCard (MMC) media, embedded MultiMediaCard (eMMC)media, Secure Digital (SD) media, SLC/MLC+ECC media, and the like.Memory is an example of media (i.e., storage media) and error correctingcode (ECC) memory is an example of a type of memory to which the mainmemory 2217 interfaces (e.g., main memory 2219).

The peripherals subsystem 2204 includes Ethernet MAC controllers 2232(i.e. a second type of peripheral controller). Each Ethernet MACcontroller 2232 can be of the universal 1Gig design configuration or the10G design configuration. The universal 1Gig design configuration offersa preferred interface description. The Ethernet MAC controllers 2232includes a control register set and a DMA (i.e., an AXI master and anAXI slave). Additionally, the peripherals subsystem 2204 can include anAXI2 Ethernet controller 2233. The peripherals subsystem 2204 includes aDMA controller 2234 (i.e., (i.e. a third type of peripheral controller).DMA functionality is useful only for fairly large transfers. Thus,because private memory of the management subsystem 2208 is relativelysmall, the assumption is that associated messages will be relativelysmall and can be handled by an interrupt process. If the managementsubsystem 2208 needs/wants large data transfer, it can power up thewhole system except the cores and then DMA is available. The peripheralssubsystem 2204 includes a SATA controller 2236 (i.e. a fourth type ofperipheral controller). The peripherals subsystem 2204 also includesPCIe controllers 2238. As will be discussed below in greater detail, aXAUI controller of the peripherals subsystem 2204 is provided forenabling interfacing with other CPU nodes (e.g., of a common node card).

The system interconnect subsystem 2206 is a packet switch that providesintra-node and inter-node packet connectivity to Ethernet and within acluster of nodes (e.g., small clusters up through integration withheterogeneous large enterprise data centers). The system interconnectsubsystem 2206 provides a high-speed interconnect fabric, providing adramatic increase in bandwidth and reduction in latency compared totraditional servers connected via 1 Gb Ethernet to a top of rack switch.Furthermore, the system interconnect subsystem 2206 is configured toprovide adaptive link width and speed to optimize power based uponutilization.

An underlying objective of the system interconnect subsystem 2206 issupport a scalable, power-optimized cluster fabric of server nodes. Assuch, the system interconnect subsystem 2206 has three primaryfunctionalities. The first one of these functionalities is serving as ahigh-speed fabric upon which TCP/IP networking is built and upon whichthe operating system of the node CPU subsystem 2202 can providetransparent network access to associated network nodes and storageaccess to associated storage nodes. The second one of thesefunctionalities is serving as a low-level messaging transport betweenassociated nodes. The third one of these functionalities is serving as atransport for remote DMA between associated nodes.

The system interconnect subsystem 2206 can be connected to the node CPUsubsystem 2202 and the management subsystem 2208 through a bus fabric(i.e., Ethernet AXIs) of the system interconnect subsystem 2206. AnEthernet interface of the system interconnect subsystem 2206 can beconnected to peripheral interfaces (e.g., interfaces 2230, 2232, 2234,2238) of the peripheral subsystem 2204. A fabric switch (i.e., aswitch-mux) can be coupled between the XAUI link ports of the systeminterconnect subsystem 2206 and one or more MAC's 2243 of the systeminterconnect subsystem 2206. The XAUI link ports and MACs (i.e.,high-speed interconnect interfaces) enabling the node that comprises theSoC 2200 to be connected to associated nodes each having their own SoC(e.g., identically configured SoCs).

The processor cores 2222 (i.e., A9 cores) of the node CPU subsystem 2202and management processor 2270 (i.e., M3) of the management subsystem2208 can address MACs (e.g., MAC 2243) of the system interconnectsubsystem 2206. In certain embodiments, the processor cores 2222 of thenode CPU subsystem 2202 will utilize a first MAC and second MAC and themanagement processor 2270 of the management subsystem 2208 will utilizea third MAC. To this end, MACs of the system interconnect subsystem 2206can be configured specifically for their respective application.

The management subsystem 2208 is coupled directly to the node CPUsubsystem 2202 and directly to the to the system interconnect subsystem2206. An inter-processor communication (IPC) module (i.e., IPCM) of themanagement subsystem 2208, which includes IPC 2216, is coupled to thenode CPU subsystem 2202, thereby directly coupling the managementsubsystem 2208 to the node CPU subsystem 2202. The management processor2270 of the management subsystem 2208 is preferably, but notnecessarily, an ARM Cortex brand M3 microprocessor. The managementprocessor 2270 can have private ROM and private SRAM. The managementprocessor 2270 can be coupled to shared peripherals and privateperipherals of the management subsystem 2208. The private peripheralsare only accessible by the management processor, whereas the sharedperipherals are accessible by the management processor 2270 and each ofthe processing cores 2222. Instructions for implementing embodiments ofthe present invention (e.g., functionalities, processes and/oroperations associated with remote memory access, pooled memory access,memcache, distributed memory, server resource disaggregation, and thelike) can reside in non-transitory memory coupled to/allocated to themanagement processor 2270.

Additional capabilities arise because the management processor 2270 hasvisibility into all buses, peripherals, and controllers. It can directlyaccess registers for statistics on all buses, memory controllers,network traffic, fabric links, and errors on all devices withoutdisturbing or even the knowledge of the access by the core processingcores 2222. This allows for billing use cases where statistics can begathered securely by the management processor without having to consumecore processing resources (e.g., the processing cores 2222) to gather,and in a manner that cannot be altered by the core processor 2222.

The management processor 2270 has a plurality of responsibilities withinits respective node. One responsibility of the management processor 2270is booting an operating system of the node CPU 2210. Anotherresponsibility of the management processor 2270 is node powermanagement. Accordingly, the management subsystem 2208 can also beconsidered to comprise a power management Unit (PMU) for the node andthus, is sometime referred to as such. As discussed below in greaterdetail, the management subsystem 2208 controls power states to variouspower domains of the SoC 2200 (e.g., to the processing cores 2222 byregulating clocks). The management subsystem 2208 is an “always-on”power domain. However, the management processor 2270 can turn off theclocks to the management processor 2270 and/or its private and/or sharedperipherals to reduce the dynamic power. Another responsibility of themanagement processor 2270 is varying synchronized clocks of the node CPUsubsystem 2202 (e.g., of the node CPU 2210 and a snoop control unit(SCU)). Another responsibility of the management processor 2270 isproviding baseboard management control (BMC) and IPMI functionalitiesincluding console virtualization. Another responsibility of themanagement processor 2270 is providing router management. Anotherresponsibility of the management processor 2270 is acting as proxy forthe processing cores 2222 for interrupts and/or for network traffic. Forexample, a generalized interrupt controller (GIC) of the node CPUsubsystem 2202 will cause interrupts intended to be received by aparticular one of the processing core 2222 to be reflected to themanagement processor 2270 for allowing the management processor 2270 towake the particular one of the processing cores 2222 when an interruptneeds to be processed by the particular one of the of the processingcores that is sleeping, as will be discussed below in greater detail.Another responsibility of the management processor 2270 is controllingphased lock loops (PLLs). A frequency is set in the PLL and it ismonitored for lock. Once lock is achieved the output is enabled to theclock control unit (CCU). The CCU is then signaled to enable thefunction. The management processor 2270 is also responsible forselecting the dividers but the actual change over will happen in asingle cycle in hardware. Another responsibility of the managementprocessor 2270 is controlling a configuration of a variable internalsupply used to supply electrical power to the node CPU subsystem 2202.For example, a plurality of discrete power supplies (e.g., some being ofdifferent power supplying specification than others (e.g., some havingdifferent power capacity levels)) can be selectively activated anddeactivated as necessary for meeting power requirements of the node CPUsubsystem 2202 (e.g., based on power demands of the processing cores2222, the SCU, and/or the controller of the L2 cache 2214). A separatepower control mechanism (e.g., switch) can be used to control powersupply to each of the processing cores 2222 and separately to the SCU.Another responsibility of the management processor 2270 is managing areal-time-clock (RTC) that exists on a shared peripheral bus of themanagement subsystem 2208. Another responsibility of the managementprocessor 2270 is managing a watchdog timer on a private peripheral busof the management subsystem 2208 to aid in recovery from catastrophicsoftware failures. Still another responsibility of the managementprocessor 2270 is managing an off-board EEPROM. The off-board EEPROM isdevice is used to store all or a portion of boot and node configurationinformation as well as all or a portion of IPMI statistics that requirenon-volatile storage. Each of these responsibilities of the managementprocessor 2270 is an operational functionality managed by the managementprocessor 2270. Accordingly, operational management functionality ofeach one of the subsystem refers to two or more of theseresponsibilities being managed by the management processor 2270.

As shown in FIG. 16 , software 3300 is provided on the managementprocessor 2270. The management processor 2270 includes a plurality ofapplication tasks 3302, an operating system (OS)/input-output (I/O)abstraction layer 3304, a real-time operating system (RTOS) 3306, anddevice drivers 3308 for the various devices. The operating system(OS)/input-output (I/O) abstraction layer 3304 is a software layer thatresides between the application tasks3 302 and the real-time operatingsystem (RTOS) 3306. The operating system (OS)/input-output (I/O)abstraction layer 3304 aids in porting acquired software into thisenvironment. The OS abstraction portion of the operating system(OS)/input-output (I/O) abstraction layer 3304 provides posix-likemessage queues, semaphores and mutexes. The device abstraction portionof the operating system (OS)/input-output (I/O) abstraction layer 3304provides a device-transparent open/close/read/write interface much likethe posix equivalent for those devices used by ported software. Thereal-time operating system (RTOS) 3306 resides between the operatingsystem (OSYinput-output (I/O) abstraction layer 3304 and the devicedrivers 3308.

The application tasks 3302 include, but are not limited to, a boot task3310, a system management task 3312, a power management task 3314, aserial concentrator task 3316, a frame switch management task 3318(sometimes called routing management), and a network proxy task 3320.The boot task 3310 provides the function of booting the processing cores2222 and the management processor 2270. The system management task3 312provides the function of integrated operation of the various subsystemsof the SOC 2200. The power management task 3314 provides the function ofmanaging power utilization of the various subsystems of the SOC 2200.The serial concentrator task 3316 provides the function of managingcommunication from the other application tasks to a system console. Thisconsole may be directly connected to the SOC node via a DART (i.e., auniversal asynchronous receiver/transmitter) or it can be connected toanother node in the system. The frame switch management task 3318(sometimes called routing management) is responsible for configuring andmanaging routing network functionality. As discussed in greater detailbelow, the network proxy task 3320 maintains network presence of one ormore of the processing cores 2222 while in a low-power sleep/hibernationstate and to intelligently wake one or more of the processing cores 2222when further processing is required.

Device drivers 3308 are provided for all of the devices that arecontrolled by the management processor 2270. Examples of the devicedrivers 3308 include, but are not limited to, an 12C driver 3322, a SMIdriver 3324, a flash driver 3326 (e.g., NAND type storage media), a UARTdriver 3328, a watchdog time (i.e., WDT) driver 3330, a general purposeinput-output (i.e., GPIO) driver 332, an Ethernet driver 3334, and anIPC driver 336. In many cases, these drivers are implemented as simplefunction calls. In some cases where needed for software portability,however, a device-transparent open/close/read/write type I/O abstractionis provided on top of these functions.

In regard to boot processes, it is well known that multiple-stage bootloaders are often used, during which several programs of increasingcomplexity sequentially load one after the other in a process of chainloading. Advantageously, however, the node CPU 2210 only runs one bootloader before loading the operating system. The ability for the node CPU2210 to only run one boot loader before loading the operating system isaccomplished via the management processor 2270 preloading a boot loaderimage into main memory (e.g., DRAM) of the node CPU subsystem beforereleasing the node CPU 2210 from a reset state. More specifically, theSOC 2200 can be configured to use a unique boot process, which includesthe management processor 2270 loading a suitable OS boot loader (e.g.,U-Boot) into main memory, starting the node CPU 2210 main OS boot loader(e.g., UEFI or U-Boot), and then loading the OS. This eliminates theneed for a boot ROM for the node CPU, a first stage boot loader for thenode CPU, and dedicated SRAM for boot of the node CPU.

While the foregoing has been with reference to a particular embodimentof the invention, it will be appreciated by those skilled in the artthat changes in this embodiment may be made without departing from theprinciples and spirit of the disclosure, the scope of which is definedby the appended claims.

1-20. (canceled)
 21. A method for use in a first computerized device,the first computerized device comprising a memory and communicative witha second computerized device via a data fabric, the method comprising:receiving a first data communication requiring at least one mappingbetween a physical address in the memory and an address space of thesecond computerized device; responsive to at least the first datacommunication, mapping the physical address in the memory to the addressspace of the second computerized device; and receiving a request fromthe second computerized device to access data stored in the physicaladdress in the memory using the mapping.
 22. The method of claim 21,wherein the mapping of the physical address in the memory to the addressspace of the second computerized device comprises creating the mappingwithin an IOMMU (Input Output Memory Management Unit) of the firstcomputerized device.
 23. The method of claim 22, wherein the creatingthe mapping within an IOMMU (Input Output Memory Management Unit) of thefirst computerized device comprises mapping at least one address windowto the physical address.
 24. The method of claim 23, further comprisinggenerating at least one base physical address associated with thewindow.
 25. The method of claim 24, further comprising causing provisionof the at least one base physical address associated with the window tothe second computerized device.
 26. The method of claim 21, wherein themapping the physical address in the memory to the address space of thesecond computerized device comprises mapping at least one address windowto the physical address.
 27. The method of claim 26, wherein the mappingthe physical address in the memory to the address space of the secondcomputerized device further comprises enabling access to the at leastone address window by the second computerized device.
 28. The method ofclaim 27, wherein the enabling access to the at least one address windowby the second computerized device comprises enabling read-write accessby the second computerized device.
 29. The method of claim 27, whereinthe enabling access to the at least one address window by the secondcomputerized device comprises enabling read-only access by the secondcomputerized device.
 30. The method of claim 21, wherein the receivingthe first data communication comprises receiving a mapping requestissued by the second computerized device, the mapping request issued bythe second computerized device and transmitted to the first computerizeddevice via at least the data fabric.
 31. The method of claim 21, furthercomprising receiving a second data communication issued by the secondcomputerized device, the second data communication configured todetermine whether the first computerized device has data correspondingto a prescribed identifier stored in the memory.
 32. A computerizeddevice, the computerized device comprising: a data memory; a processorin data communication with the data memory; a data interface enablingdata communication with at least one other computerized device; andcomputerized logic configured to: generate a first data communicationconfigured to request at least one mapping between a physical address ina memory of a second computerized device and an address space of thecomputerized device; cause transmission of the first data communicationto the second computerized device via at least a data fabric in datacommunication with the data interface; receive via the data interface asecond data communication, the second data communication configured toprovide data to the computerized device relating to the mapping of thephysical address in the memory of the second computerized device to theaddress space of the computerized device; and generate a third datacommunication configured to request access to data stored in thephysical address using at least a portion of the provided data relatingto the mapping.
 33. The computerized device of claim 32, wherein atleast the computerized device and the data fabric reside on a single SoC(system on chip) semiconductor die.
 34. The computerized device of claim32, wherein at least the second computerized device and the data fabricreside on a single SoC (system on chip) semiconductor die.
 35. Thecomputerized device of claim 32, wherein at least the computerizeddevice, the second computerized device, and the data fabric reside on acommon substrate.
 36. The computerized device of claim 32, wherein thegeneration of the first data communication configured to request atleast one mapping between a physical address in a memory of a secondcomputerized device and an address space of the computerized device isresponsive to a data access request from at least one of (i) theprocessor, or (ii) a second processor in data communication with thecomputerized device.
 37. The computerized device of claim 32, whereinthe provided data relating to the mapping of the physical address in thememory of the second computerized device to the address space of thecomputerized device comprises at least one base address relating to anaddress window.
 38. The computerized device of claim 32, furthercomprising an IOMMU (Input Output Memory Management Unit), and whereinthe computerized logic is further configured to utilize the provideddata relating to the mapping of the physical address in the memory ofthe second computerized device to the address space of the computerizeddevice to modify at least one aspect of the IOMMU to support access tothe physical address.
 39. The computerized device of claim 38, whereinthe modification of the at least one aspect of the IOMMU to supportaccess to the physical address comprises producing a mapping of at leastone address within the data memory to the physical address.
 40. A methodfor use in a first computerized node, the first computerized nodecomprising a memory, a data interface, a host processor, and a secondprocessor, the first computerized node configured for communication witha computerized device via a data fabric via at least the data interface,the method comprising: receiving, at the second processor, a first datacommunication issued by the host processor; based at least on thereceived first data communication, generating a second datacommunication configured to request at least one mapping between atleast one address in a memory of the computerized device and an addressspace of the first computerized node; causing transmission of the seconddata communication to the computerized device via at least a data fabricin data communication with the data interface; receiving via the datainterface a third data communication, the third data communicationconfigured to provide data to the first computerized node relating tothe mapping of the at least one address in the memory of thecomputerized device to the address space of the first computerized node;and generate a fourth data communication configured to provide at leasta portion of the provided data relating to the mapping to the hostprocessor.